AITopics | audio data

Collaborating Authors

audio data

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

HowCanIExplainThistoYou?AnEmpiricalStudy ofDeepNeuralNetworkExplanationMethods

Neural Information Processing SystemsFeb-7-2026, 22:44:20 GMT

Although many of these toolkits are available for use, it is unclear which style of explanation is preferred by end-users, thereby demanding investigation.

artificial intelligence, explanation, machine learning, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Asia > British Indian Ocean Territory > Diego Garcia (0.04)

Industry: Health & Medicine > Diagnostic Medicine (0.46)

Technology:

Information Technology > Communications (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Add feedback

Threat Modeling for Enhancing Security of IoT Audio Classification Devices under a Secure Protocols Framework

Benlloch-Lopez, Sergio, Viel-Vazquez, Miquel, Naranjo-Alcazar, Javier, Grau-Haro, Jordi, Zuccarello, Pedro

arXiv.org Artificial IntelligenceNov-17-2025

The rapid proliferation of IoT nodes equipped with microphones and capable of performing on-device audio classification exposes highly sensitive data while operating under tight resource constraints. To protect against this, we present a defence-in-depth architecture comprising a security protocol that treats the edge device, cellular network and cloud backend as three separate trust domains, linked by TPM-based remote attestation and mutually authenticated TLS 1.3. A STRIDE-driven threat model and attack-tree analysis guide the design. At startup, each boot stage is measured into TPM PCRs. The node can only decrypt its LUKS-sealed partitions after the cloud has verified a TPM quote and released a one-time unlock key. This ensures that rogue or tampered devices remain inert. Data in transit is protected by TLS 1.3 and hybridised with Kyber and Dilithium to provide post-quantum resilience. Meanwhile, end-to-end encryption and integrity hashes safeguard extracted audio features. Signed, rollback-protected AI models and tamper-responsive sensors harden firmware and hardware. Data at rest follows a 3-2-1 strategy comprising a solid-state drive sealed with LUKS, an offline cold archive encrypted with a hybrid post-quantum cipher and an encrypted cloud replica. Finally, we set out a plan for evaluating the physical and logical security of the proposed protocol.

artificial intelligence, data mining, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2509.14657

Country: Europe (0.28)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Internet of Things (1.00)
Information Technology > Data Science > Data Mining (1.00)
(2 more...)

Add feedback

WavInWav: Time-domain Speech Hiding via Invertible Neural Network

Fan, Wei, Chen, Kejiang, Wang, Xiangkun, Zhang, Weiming, Yu, Nenghai

arXiv.org Artificial IntelligenceOct-6-2025

Data hiding is essential for secure communication across digital media, and recent advances in Deep Neural Networks (DNNs) provide enhanced methods for embedding secret information effectively. However, previous audio hiding methods often result in unsatisfactory quality when recovering secret audio, due to their inherent limitations in the modeling of time-frequency relationships. In this paper, we explore these limitations and introduce a new DNN-based approach. We use a flow-based invertible neural network to establish a direct link between stego audio, cover audio, and secret audio, enhancing the reversibility of embedding and extracting messages. To address common issues from time-frequency transformations that degrade secret audio quality during recovery, we implement a time-frequency loss on the time-domain signal. This approach not only retains the benefits of time-frequency constraints but also enhances the reversibility of message recovery, which is vital for practical applications. We also add an encryption technique to protect the hidden data from unauthorized access. Experimental results on the VCTK and LibriSpeech datasets demonstrate that our method outperforms previous approaches in terms of subjective and objective metrics and exhibits robustness to various types of noise, suggesting its utility in targeted secure communication scenarios.

artificial intelligence, audio, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2510.02915

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Step-Audio 2 Technical Report

Wu, Boyong, Yan, Chao, Hu, Chen, Yi, Cheng, Feng, Chengli, Tian, Fei, Shen, Feiyu, Yu, Gang, Zhang, Haoyang, Li, Jingbei, Chen, Mingrui, Liu, Peng, You, Wang, Zhang, Xiangyu Tony, Li, Xingyuan, Yang, Xuerui, Deng, Yayue, Huang, Yechang, Li, Yuxin, Zhang, Yuxin, You, Zhao, Li, Brian, Wan, Changyi, Hu, Hanpeng, Zhen, Jiangjie, Chen, Siyu, Yuan, Song, Zhang, Xuelin, Jiang, Yimin, Zhou, Yu, Yang, Yuxiang, Li, Bingxin, Ma, Buyun, Song, Changhe, Pang, Dongqing, Hu, Guoqiang, Sun, Haiyang, An, Kang, Wang, Na, Gao, Shuli, Ji, Wei, Li, Wen, Sun, Wen, Wen, Xuan, Ren, Yong, Ma, Yuankai, Lu, Yufan, Wang, Bin, Li, Bo, Miao, Changxin, Liu, Che, Xu, Chen, Shi, Dapeng, Hu, Dingyuan, Wu, Donghang, Liu, Enle, Huang, Guanzhe, Yan, Gulin, Zhang, Han, Nie, Hao, Jia, Haonan, Zhou, Hongyu, Sun, Jianjian, Wu, Jiaoren, Wu, Jie, Yang, Jie, Yang, Jin, Lin, Junzhe, Li, Kaixiang, Yang, Lei, Shi, Liying, Zhou, Li, Gu, Longlong, Li, Ming, Li, Mingliang, Li, Mingxiao, Wu, Nan, Han, Qi, Tan, Qinyuan, Pang, Shaoliang, Fan, Shengjie, Liu, Siqi, Cao, Tiancheng, Lu, Wanying, He, Wenqing, Xie, Wuxun, Zhao, Xu, Li, Xueqi, Yu, Yanbo, Yang, Yang, Liu, Yi, Lu, Yifan, Wang, Yilei, Ding, Yuanhao, Liang, Yuanwei, Lu, Yuanwei, Luo, Yuchu, Yin, Yuhe, Zhan, Yumeng, Zhang, Yuxiang, Yang, Zidong, Zhang, Zixin, Jiao, Binxing, Jiang, Daxin, Shum, Heung-Yeung, Chen, Jiansheng, Li, Jing, Zhang, Xiangyu, Zhu, Yibo

arXiv.org Artificial IntelligenceAug-28-2025

This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2507.16632

Country: Asia (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information

Wang, Rui, Sun, Qianguo, Chen, Tianrong, Zeng, Zhiyun, Wu, Junlong, Zhang, Jiaxing

arXiv.org Artificial IntelligenceMay-26-2025

The emergence of multi-codebook neutral audio codecs such as Residual Vector Quantization (RVQ) and Group Vector Quantization (GVQ) has significantly advanced Large-Language-Model (LLM) based Text-to-Speech (TTS) systems. These codecs are crucial in separating semantic and acoustic information while efficiently harnessing semantic priors. However, since semantic and acoustic information cannot be fully aligned, a significant drawback of these methods when applied to LLM-based TTS is that large language models may have limited access to comprehensive audio information. To address this limitation, we propose DistilCodec and UniTTS, which collectively offer the following advantages: 1) This method can distill a multi-codebook audio codec into a single-codebook audio codec with 32,768 codes while achieving a near 100\% utilization. 2) As DistilCodec does not employ a semantic alignment scheme, a large amount of high-quality unlabeled audio (such as audiobooks with sound effects, songs, etc.) can be incorporated during training, further expanding data diversity and broadening its applicability. 3) Leveraging the comprehensive audio information modeling of DistilCodec, we integrated three key tasks into UniTTS's pre-training framework: audio modality autoregression, text modality autoregression, and speech-text cross-modal autoregression. This allows UniTTS to accept interleaved text and speech/audio prompts while substantially preserving LLM's text capabilities. 4) UniTTS employs a three-stage training process: Pre-Training, Supervised Fine-Tuning (SFT), and Alignment. Source code and model checkpoints are publicly available at https://github.com/IDEA-Emdoor-Lab/UniTTS and https://github.com/IDEA-Emdoor-Lab/DistilCodec.

distilcodec, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2505.17426

Genre: Research Report > New Finding (0.67)

Industry:

Education (0.46)
Media (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Edge Intelligence for Wildlife Conservation: Real-Time Hornbill Call Classification Using TinyML

Hing, Kong Ka, Behjati, Mehran

arXiv.org Artificial IntelligenceApr-17-2025

Hornbills, an iconic species of Malaysia's biodiversity, face threats from habitat loss, poaching, and environmental changes, necessitating accurate and real - time population monitoring that is traditionally challenging and resource intensive. The emergence of Tiny Machine Learning (TinyML) offers a chance to transform wildlife monitoring by enabling efficient, real - time data analysis directly on edge devices. Addressing the challenge of wildlife conservation, this research paper explores the pivotal role of machine learning, specifically TinyML, in the classification and monitoring of hornbill calls in Malaysia. Leveraging audio data from the Xeno - canto database, the study aims to develop a speech recognition system capable of identifying and classifying hornbill vocalizations. The proposed methodology involves preprocessing the audio data, extracting features using Mel - Frequency Energy (MFE), and deploying the model on an Arduino Nano 33 BLE, which is adept at edge computing. The research encompasses foundational work, including a comprehensive introduction, literature review, and methodology. The model is trained using Edge Impulse and validated through real - world tests, achieving high accuracy in hornbill species identification. The project underscores the potential of TinyML for environmental monitoring and its broader application in ecological conservation efforts, contributing to both the field of TinyML and wildlife conservation.

artificial intelligence, machine learning, real time system, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-981-96-3949-6_40

2504.12272

Country: Asia > Malaysia (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Architecture > Real Time Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)

Add feedback

FinAudio: A Benchmark for Audio Large Language Models in Financial Applications

Cao, Yupeng, Li, Haohang, Yu, Yangyang, Javaji, Shashidhar Reddy, He, Yueru, Huang, Jimin, Zhu, Zining, Xie, Qianqian, Liu, Xiao-yang, Subbalakshmi, Koduvayur, Qiu, Meikang, Ananiadou, Sophia, Nie, Jian-Yun

arXiv.org Artificial IntelligenceMar-26-2025

Audio Large Language Models (AudioLLMs) have received widespread attention and have significantly improved performance on audio tasks such as conversation, audio understanding, and automatic speech recognition (ASR). Despite these advancements, there is an absence of a benchmark for assessing AudioLLMs in financial scenarios, where audio data, such as earnings conference calls and CEO speeches, are crucial resources for financial analysis and investment decisions. In this paper, we introduce \textsc{FinAudio}, the first benchmark designed to evaluate the capacity of AudioLLMs in the financial domain. We first define three tasks based on the unique characteristics of the financial domain: 1) ASR for short financial audio, 2) ASR for long financial audio, and 3) summarization of long financial audio. Then, we curate two short and two long audio datasets, respectively, and develop a novel dataset for financial audio summarization, comprising the \textsc{FinAudio} benchmark. Then, we evaluate seven prevalent AudioLLMs on \textsc{FinAudio}. Our evaluation reveals the limitations of existing AudioLLMs in the financial domain and offers insights for improving AudioLLMs. All datasets and codes will be released.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2503.2099

Country:

North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Industry: Banking & Finance > Trading (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Synthetic Audio Helps for Cognitive State Tasks

Soubki, Adil, Murzaku, John, Zeng, Peter, Rambow, Owen

arXiv.org Artificial IntelligenceFeb-10-2025

The NLP community has broadly focused on text-only approaches of cognitive state tasks, but audio can provide vital missing cues through prosody. We posit that text-to-speech models learn to track aspects of cognitive state in order to produce naturalistic audio, and that the signal audio models implicitly identify is orthogonal to the information that language models exploit. We present Synthetic Audio Data fine-tuning (SAD), a framework where we show that 7 tasks related to cognitive state modeling benefit from multimodal training on both text and zero-shot synthetic audio data from an off-the-shelf TTS system. We show an improvement over the text-only modality when adding synthetic audio data to text-only corpora. Furthermore, on tasks and corpora that do contain gold audio, we show our SAD framework achieves competitive performance with text and synthetic audio compared to text and gold audio.

computational linguistic, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2502.06922

Country: